One of the most popular unsupervised classification techniques (unsupervised means there is no “right answer”- the data hasn’t been labelled)
Partitioning the dataset into distinct non-overlapping clusters where each datapoint belongs to only one group.
In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
(- read something about k-means clustering being helpful for supervised learning and the example used the clusters for random forest)
Usually applied to data with smaller number of dimensions, is numeric and continuous.
Example areas of use:
- Customer Segmentation
- Document Classification
- Identifying crime localities
- Insurance fraud detection
- Cyber-Profiling criminals
https://uc-r.github.io/kmeans_clustering
library(tidyverse) # data manipulation
library(cluster) # clustering algorithms
library(factoextra) # clustering algorithms & visualization
library(plotly)df <- USArrests
df <- na.omit(df)
df <- scale(df)kmax<-10
WSSus_arrests<-sapply(1:kmax, function(k) kmeans(df, centers = k, nstart = 10)$tot.withinss)
plot(1:kmax, WSSus_arrests, type = 'b', xlab = 'k', ylab = 'Total wss')
abline(v=4, lty=2)k4 <- kmeans(df, centers = 4, nstart = 25)df %>%
as_tibble() %>%
mutate(cluster = k4$cluster,
state = row.names(USArrests)) %>%
ggplot(aes(UrbanPop, Murder, color = factor(cluster), label = state)) +
geom_text()Alaska, Arkansas and Kentucky are all quite similar for their murder and urbanpop values so they must differ on other values - lets check.
df %>%
as_tibble() %>%
mutate(cluster = k4$cluster,
state = row.names(USArrests)) %>%
ggplot(aes(Assault, Murder, color = factor(cluster), label = state)) +
geom_text()df <- df %>%
as_tibble() %>%
mutate(cluster = k4$cluster,
state = row.names(USArrests))
plot_ly(df, x=~UrbanPop, y=~Murder, z=~Assault,
color =~ cluster, type = "scatter3d", mode = "markers", text = "state")